{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.3 实践：基于Softmax回归完成鸢尾花分类任务\n",
    "\n",
    "在本节，我们用入门深度学习的基础实验之一“鸢尾花分类任务”来进行实践，使用经典学术数据集Iris作为训练数据，实现基于Softmax回归的鸢尾花分类任务。\n",
    "\n",
    "实践流程主要包括以下7个步骤：数据处理、模型构建、损失函数定义、优化器构建、模型训练、模型评价和模型预测等，\n",
    "\n",
    "* 数据处理：根据网络接收的数据格式，完成相应的预处理操作，保证模型正常读取；\n",
    "* 模型构建：定义Softmax回归模型类；\n",
    "* 训练配置：训练相关的一些配置，如：优化算法、评价指标等；\n",
    "* 组装Runner类：Runner用于管理模型训练和测试过程；\n",
    "* 模型训练和测试：利用Runner进行模型训练、评价和测试。\n",
    "\n",
    "****\n",
    "**说明：**\n",
    "\n",
    "使用深度学习进行实践时的操作流程基本一致，后文不再赘述。\n",
    "\n",
    "****\n",
    "\n",
    "本实践的主要配置如下：\n",
    "\n",
    "* 数据：Iris数据集；\n",
    "* 模型：Softmax回归模型；\n",
    "* 损失函数：交叉熵损失；\n",
    "* 优化器：梯度下降法；\n",
    "* 评价指标：准确率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### 3.3.1 数据处理\n",
    "\n",
    "#### 3.3.1.1 数据集介绍\n",
    "\n",
    "Iris数据集，也称为鸢尾花数据集，包含了3种鸢尾花类别（Setosa、Versicolour、Virginica），每种类别有50个样本，共计150个样本。其中每个样本中包含了4个属性：花萼长度、花萼宽度、花瓣长度以及花瓣宽度，本实验通过鸢尾花这4个属性来判断该样本的类别。\n",
    "\n",
    "\n",
    "**鸢尾花属性**\n",
    "<center>\n",
    "  \n",
    "|属性1   |属性2   |属性3    |属性4    |\n",
    "|:--:    |:--:   |:--:    |:--:     |\n",
    "|sepal_length |sepal_width|petal_length|petal_width|\n",
    "|花萼长度      |花萼宽度|花瓣长度|花瓣宽度|\n",
    "</center>\n",
    "\n",
    "\n",
    "**鸢尾花类别**\n",
    "<center>\n",
    "  \n",
    "|英文名            |中文名      |标签    |\n",
    "|:--:             |:--:        |:--:    |\n",
    "|Setosa Iris      |狗尾草鸢尾   |  0\n",
    "|Versicolour Iris |杂色鸢尾     |  1\n",
    "|Virginica Iris   |弗吉尼亚鸢尾  |  2\n",
    "  \n",
    "</center>\n",
    "\n",
    "**鸢尾花属性类别对应预览**\n",
    "\n",
    "<center>\n",
    "  \n",
    "|sepal_length |sepal_width|petal_length|petal_width| species \n",
    "|:--:         |:--:       |:--:        |:--:       |:--:       |\n",
    "|5.1\t      |3.5\t      |1.4\t       |0.2\t       |setosa\n",
    "|4.9\t      |3\t      |1.4\t       |0.2\t       |setosa\n",
    "|4.7\t      |3.2\t      |1.3\t       |0.2\t       |setosa\n",
    "|...          |...        |...         |...        |...        |\n",
    "\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### 3.3.1.2 数据清洗\n",
    "\n",
    "* **缺失值分析**\n",
    "\n",
    "对数据集中的缺失值或异常值等情况进行分析和处理，保证数据可以被模型正常读取。代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "import pandas\n",
    "import numpy as np\n",
    "\n",
    "iris_features = np.array(load_iris().data, dtype=np.float32)\n",
    "iris_labels = np.array(load_iris().target, dtype=np.int32)\n",
    "print(pandas.isna(iris_features).sum())\n",
    "print(pandas.isna(iris_labels).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "从输出结果看，鸢尾花数据集中不存在缺失值的情况。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "* **异常值处理**\n",
    "\n",
    "通过箱线图直观的显示数据分布，并观测数据中的异常值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 1000x1000 with 4 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt #可视化工具\n",
    "\n",
    "# 箱线图查看异常值分布\n",
    "def boxplot(features):\n",
    "    feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n",
    "\n",
    "    # 连续画几个图片\n",
    "    plt.figure(figsize=(5, 5), dpi=200)\n",
    "    # 子图调整\n",
    "    plt.subplots_adjust(wspace=0.6)\n",
    "    # 每个特征画一个箱线图\n",
    "    for i in range(4):\n",
    "        plt.subplot(2, 2, i+1)\n",
    "        # 画箱线图\n",
    "        plt.boxplot(features[:, i], \n",
    "                    showmeans=True, \n",
    "                    whiskerprops={\"color\":\"#E20079\", \"linewidth\":0.4, 'linestyle':\"--\"},\n",
    "                    flierprops={\"markersize\":0.4},\n",
    "                    meanprops={\"markersize\":1})\n",
    "        # 图名\n",
    "        plt.title(feature_names[i], fontdict={\"size\":5}, pad=2)\n",
    "        # y方向刻度\n",
    "        plt.yticks(fontsize=4, rotation=90)\n",
    "        plt.tick_params(pad=0.5)\n",
    "        # x方向刻度\n",
    "        plt.xticks([])\n",
    "    plt.savefig('ml-vis.pdf')\n",
    "    plt.show()\n",
    "\n",
    "boxplot(iris_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "从输出结果看，数据中基本不存在异常值，所以不需要进行异常值处理。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### 3.3.1.3 数据读取\n",
    "\n",
    "本实验中将数据集划分为了三个部分：\n",
    "\n",
    "* 训练集：用于确定模型参数；\n",
    "* 验证集：与训练集独立的样本集合，用于使用提前停止策略选择最优模型；\n",
    "* 测试集：用于估计应用效果（没有在模型中应用过的数据，更贴近模型在真实场景应用的效果）。\n",
    "\n",
    "在本实验中，将$80\\%$的数据用于模型训练，$10\\%$的数据用于模型验证，$10\\%$的数据用于模型测试。代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "W0711 14:19:02.852123  1587 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.2, Runtime API Version: 11.2\n",
      "W0711 14:19:02.855590  1587 device_context.cc:465] device: 0, cuDNN Version: 8.2.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X shape:  [150, 4] y shape:  [150]\n"
     ]
    }
   ],
   "source": [
    "import copy\n",
    "import paddle \n",
    "\n",
    "# 加载数据集\n",
    "def load_data(shuffle=True):\n",
    "    \"\"\"\n",
    "    加载鸢尾花数据\n",
    "    输入：\n",
    "        - shuffle：是否打乱数据，数据类型为bool\n",
    "    输出：\n",
    "        - X：特征数据，shape=[150,4]\n",
    "        - y：标签数据, shape=[150]\n",
    "    \"\"\"\n",
    "    # 加载原始数据\n",
    "    X = np.array(load_iris().data, dtype=np.float32)\n",
    "    y = np.array(load_iris().target, dtype=np.int32)\n",
    "\n",
    "    X = paddle.to_tensor(X)\n",
    "    y = paddle.to_tensor(y)\n",
    "\n",
    "    # 数据归一化\n",
    "    X_min = paddle.min(X, axis=0)\n",
    "    X_max = paddle.max(X, axis=0)\n",
    "    X = (X-X_min) / (X_max-X_min)\n",
    "\n",
    "    # 如果shuffle为True，随机打乱数据\n",
    "    if shuffle:\n",
    "        idx = paddle.randperm(X.shape[0])\n",
    "        X = X[idx]\n",
    "        y = y[idx]\n",
    "    return X, y\n",
    "\n",
    "# 固定随机种子\n",
    "paddle.seed(102)\n",
    "\n",
    "num_train = 120\n",
    "num_dev = 15\n",
    "num_test = 15\n",
    "\n",
    "X, y = load_data(shuffle=True)\n",
    "print(\"X shape: \", X.shape, \"y shape: \", y.shape)\n",
    "X_train, y_train = X[:num_train], y[:num_train]\n",
    "X_dev, y_dev = X[num_train:num_train + num_dev], y[num_train:num_train + num_dev]\n",
    "X_test, y_test = X[num_train + num_dev:], y[num_train + num_dev:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X_train shape:  [120, 4] y_train shape:  [120]\n"
     ]
    }
   ],
   "source": [
    "# 打印X_train和y_train的维度\n",
    "print(\"X_train shape: \", X_train.shape, \"y_train shape: \", y_train.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(shape=[5], dtype=int32, place=CUDAPlace(0), stop_gradient=True,\n",
      "       [0, 1, 1, 0, 1])\n"
     ]
    }
   ],
   "source": [
    "# 打印前5个数据的标签\n",
    "print(y_train[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### 3.3.2 模型构建\n",
    "\n",
    "使用Softmax回归模型进行鸢尾花分类实验，将模型的输入维度定义为4，输出维度定义为3。代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nndl import op\n",
    "\n",
    "# 输入维度\n",
    "input_dim = 4\n",
    "# 类别数\n",
    "output_dim = 3\n",
    "# 实例化模型\n",
    "model = op.model_SR(input_dim=input_dim, output_dim=output_dim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### 3.3.3 模型训练\n",
    "\n",
    "实例化RunnerV2类，使用训练集和验证集进行模型训练，共训练80个epoch，其中每隔10个epoch打印训练集上的指标，并且保存准确率最高的模型作为最佳模型。代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "best accuracy performence has been updated: 0.00000 --> 0.40000\n",
      "[Train] epoch: 0, loss: 1.09861159324646, score: 0.31666672229766846\n",
      "[Dev] epoch: 0, loss: 1.0828731060028076, score: 0.40000003576278687\n",
      "best accuracy performence has been updated: 0.40000 --> 0.46667\n",
      "best accuracy performence has been updated: 0.46667 --> 0.66667\n",
      "best accuracy performence has been updated: 0.66667 --> 0.73333\n",
      "[Train] epoch: 10, loss: 0.9869819283485413, score: 0.6416667699813843\n",
      "[Dev] epoch: 10, loss: 0.974640965461731, score: 0.73333340883255\n",
      "[Train] epoch: 20, loss: 0.9098353385925293, score: 0.6500000953674316\n",
      "[Dev] epoch: 20, loss: 0.8997330069541931, score: 0.73333340883255\n",
      "[Train] epoch: 30, loss: 0.847343385219574, score: 0.6833333969116211\n",
      "[Dev] epoch: 30, loss: 0.8396621346473694, score: 0.73333340883255\n",
      "[Train] epoch: 40, loss: 0.7956345677375793, score: 0.7250000238418579\n",
      "[Dev] epoch: 40, loss: 0.7903936505317688, score: 0.73333340883255\n",
      "[Train] epoch: 50, loss: 0.7523643970489502, score: 0.7416666746139526\n",
      "[Dev] epoch: 50, loss: 0.7495585083961487, score: 0.73333340883255\n",
      "[Train] epoch: 60, loss: 0.7158005237579346, score: 0.783333420753479\n",
      "[Dev] epoch: 60, loss: 0.7153650522232056, score: 0.73333340883255\n",
      "best accuracy performence has been updated: 0.73333 --> 0.80000\n",
      "[Train] epoch: 70, loss: 0.6845439672470093, score: 0.8083333969116211\n",
      "[Dev] epoch: 70, loss: 0.6863659620285034, score: 0.8000000715255737\n",
      "best accuracy performence has been updated: 0.80000 --> 0.86667\n",
      "[Train] epoch: 80, loss: 0.657481849193573, score: 0.8166667819023132\n",
      "[Dev] epoch: 80, loss: 0.6614803075790405, score: 0.8666667342185974\n",
      "[Train] epoch: 90, loss: 0.6339080333709717, score: 0.8416667580604553\n",
      "[Dev] epoch: 90, loss: 0.64008629322052, score: 0.8666667342185974\n",
      "[Train] epoch: 100, loss: 0.613059937953949, score: 0.8416667580604553\n",
      "[Dev] epoch: 100, loss: 0.6212700009346008, score: 0.8666667342185974\n"
     ]
    }
   ],
   "source": [
    "from nndl import op, metric, opitimizer, RunnerV2\n",
    "\n",
    "# 学习率\n",
    "lr = 0.2\n",
    "\n",
    "# 梯度下降法\n",
    "optimizer = opitimizer.SimpleBatchGD(init_lr=lr, model=model)\n",
    "# 交叉熵损失\n",
    "loss_fn = op.MultiCrossEntropyLoss()\n",
    "# 准确率\n",
    "metric = metric.accuracy\n",
    "\n",
    "# 实例化RunnerV2\n",
    "runner = RunnerV2(model, optimizer, metric, loss_fn)\n",
    "\n",
    "# 启动训练\n",
    "runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=200, log_epochs=10, save_path=\"best_model.pdparams\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "可视化观察训练集与验证集的准确率变化情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nndl import plot\n",
    "\n",
    "plot(runner,fig_name='linear-acc3.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### 3.3.4 模型评价\n",
    "\n",
    "使用测试数据对在训练过程中保存的最佳模型进行评价，观察模型在测试集上的准确率情况。代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# 加载最优模型\n",
    "runner.load_model('best_model.pdparams')\n",
    "# 模型评价\n",
    "score, loss = runner.evaluate([X_test, y_test])\n",
    "print(\"[Test] score/loss: {:.4f}/{:.4f}\".format(score, loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### 3.3.5 模型预测\n",
    "\n",
    "使用保存好的模型，对测试集中的数据进行模型预测，并取出1条数据观察模型效果。代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# 预测测试集数据\n",
    "logits = runner.predict(X_test)\n",
    "# 观察其中一条样本的预测结果\n",
    "pred = paddle.argmax(logits[0]).numpy()\n",
    "# 获取该样本概率最大的类别\n",
    "label = y_test[0].numpy()\n",
    "# 输出真实类别与预测类别\n",
    "print(\"The true category is {} and the predicted category is {}\".format(label[0], pred[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.4 小结\n",
    "\n",
    "本节实现了Logistic回归和Softmax回归两种基本的线性分类模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.5 实验拓展\n",
    "\n",
    "为了加深对机器学习模型的理解，请自己动手完成以下实验：\n",
    "\n",
    "1. 尝试调整学习率和训练轮数等超参数，观察是否能够得到更高的精度；\n",
    "1. 在解决多分类问题时，还有一个思路是将每个类别的求解问题拆分成一个二分类任务，通过判断是否属于该类别来判断最终结果。请分别尝试两种求解思路，观察哪种能够取得更好的结果；\n",
    "1. 尝试使用《神经网络与深度学习》中的其他模型进行鸢尾花识别任务，观察是否能够得到更高的精度。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "py35-paddle1.2.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
